Our best model is model6.
It has the highest R^2 value of 0.306, meaning that 30.6% of the variance in the natural logarithm of price_4_nights can be explained by the independent variables in the model. The factors which influence price_4_nights include the property type, which is split into the following categories: entire rental unit, private room in rental unit, private room in residential home, and other. All of these factors are statistically significant at the 95% level of confidence. Other variables included in the model are the review score rating, number of bedrooms, number of beds, the number of people the Airbnb can accommodate, the number of bathrooms; all of which are also significant. The last factors which influence the price for 4 nights are the ability to book instantly, and the availability. These are statistically significant at the 95% level of confidence, however, have a smaller impact on the price for 4 nights compared to the other variables.
In our final group assignment we have analysed data about Airbnb listings and fitted a model to predict the total cost for two people staying 4 nights in an AirBnB in a city. We downloaded AirBnB data from insideairbnb.com;it was originally scraped from airbnb.com.
We have selected Buenos Aires to work on and we used the vroom::vroom() function to download the AirBnB listing data from the Google sheet provided by Professor Kostis.
Before starting the analysis, we would like to introduce the datadrame to the reader of this document. There are many variables in the dataframe, below is a quick description of some of the variables collected, and you can find a data dictionary here.
price = cost per night
property_type: type of accommodation (House, Apartment, etc.)
room_type:
number_of_reviews: Total number of reviews for the listing
review_scores_rating: Average review score (0 - 100)
longitude , latitude: geographical coordinates to help us locate the listing
neighbourhood*: three variables on a few major neighbourhoods in each city
Let’s first take a look the raw dataframe.
#Allows us to see the various columns in a dataframe
glimpse(listings)Rows: 18,438
Columns: 74
$ id <dbl> 6283, 11508, 12463, 13095~
$ listing_url <chr> "https://www.airbnb.com/r~
$ scrape_id <dbl> 2.021093e+13, 2.021093e+1~
$ last_scraped <date> 2021-09-29, 2021-09-28, ~
$ name <chr> "Casa Al Sur", "Amazing L~
$ description <chr> "<b>The space</b><br />Th~
$ neighborhood_overview <chr> NA, "AREA: PALERMO SOHO<b~
$ picture_url <chr> "https://a0.muscache.com/~
$ host_id <dbl> 13310, 42762, 48799, 5099~
$ host_url <chr> "https://www.airbnb.com/u~
$ host_name <chr> "Pamela", "Candela", "Mat~
$ host_since <date> 2009-04-13, 2009-10-01, ~
$ host_location <chr> "New York, New York, Unit~
$ host_about <chr> "I'm from Argentina but l~
$ host_response_time <chr> "N/A", "N/A", "N/A", "wit~
$ host_response_rate <chr> "N/A", "N/A", "N/A", "100~
$ host_acceptance_rate <chr> "N/A", "100%", "N/A", "N/~
$ host_is_superhost <lgl> FALSE, TRUE, FALSE, FALSE~
$ host_thumbnail_url <chr> "https://a0.muscache.com/~
$ host_picture_url <chr> "https://a0.muscache.com/~
$ host_neighbourhood <chr> "Balvanera", "Palermo", "~
$ host_listings_count <dbl> 1, 1, 1, 7, 7, 7, 7, 7, 1~
$ host_total_listings_count <dbl> 1, 1, 1, 7, 7, 7, 7, 7, 1~
$ host_verifications <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ neighbourhood <chr> NA, "Buenos Aires, Capita~
$ neighbourhood_cleansed <chr> "Balvanera", "Palermo", "~
$ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, N~
$ latitude <dbl> -34.60523, -34.58184, -34~
$ longitude <dbl> -58.41042, -58.42415, -58~
$ property_type <chr> "Entire rental unit", "En~
$ room_type <chr> "Entire home/apt", "Entir~
$ accommodates <dbl> 2, 2, 1, 2, 2, 2, 3, 4, 3~
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text <chr> "1 bath", "1 bath", "1 ba~
$ bedrooms <dbl> NA, 1, 1, 1, 1, 1, 1, 1, ~
$ beds <dbl> 1, 1, 1, 1, 2, 2, 3, 3, 1~
$ amenities <chr> "[\"Pool\", \"Heating\", ~
$ price <chr> "$4,930.00", "$6,408.00",~
$ minimum_nights <dbl> 3, 2, 1, 1, 1, 1, 1, 1, 5~
$ maximum_nights <dbl> 30, 1125, 4, 60, 60, 60, ~
$ minimum_minimum_nights <dbl> 3, 2, 1, 1, 1, 1, 1, 1, 5~
$ maximum_minimum_nights <dbl> 3, 2, 1, 1, 1, 1, 1, 1, 5~
$ minimum_maximum_nights <dbl> 30, 1125, 4, 60, 60, 60, ~
$ maximum_maximum_nights <dbl> 30, 1125, 4, 60, 60, 60, ~
$ minimum_nights_avg_ntm <dbl> 3.0, 2.0, 1.0, 1.0, 1.0, ~
$ maximum_nights_avg_ntm <dbl> 30, 1125, 4, 60, 60, 60, ~
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30 <dbl> 30, 0, 30, 30, 30, 30, 30~
$ availability_60 <dbl> 60, 0, 60, 60, 60, 60, 60~
$ availability_90 <dbl> 90, 0, 90, 90, 90, 90, 90~
$ availability_365 <dbl> 365, 148, 365, 365, 365, ~
$ calendar_last_scraped <date> 2021-09-29, 2021-09-28, ~
$ number_of_reviews <dbl> 1, 27, 20, 1, 0, 1, 0, 1,~
$ number_of_reviews_ltm <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0~
$ number_of_reviews_l30d <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ first_review <date> 2011-01-31, 2015-01-05, ~
$ last_review <date> 2011-01-31, 2021-04-03, ~
$ review_scores_rating <dbl> 4.00, 4.74, 4.76, 5.00, N~
$ review_scores_accuracy <dbl> 5.00, 4.92, 4.81, 5.00, N~
$ review_scores_cleanliness <dbl> 4.00, 4.85, 4.88, 5.00, N~
$ review_scores_checkin <dbl> 5.00, 4.88, 4.88, 5.00, N~
$ review_scores_communication <dbl> 5.00, 4.96, 4.88, 5.00, N~
$ review_scores_location <dbl> 4.00, 4.92, 4.75, 5.00, N~
$ review_scores_value <dbl> 4.00, 4.96, 4.88, 5.00, N~
$ license <lgl> NA, NA, NA, NA, NA, NA, N~
$ instant_bookable <lgl> FALSE, FALSE, FALSE, FALS~
$ calculated_host_listings_count <dbl> 1, 1, 1, 7, 7, 7, 7, 7, 1~
$ calculated_host_listings_count_entire_homes <dbl> 1, 1, 0, 0, 0, 0, 0, 0, 1~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 1, 7, 7, 7, 7, 7, 0~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month <dbl> 0.01, 0.33, 0.17, 0.03, N~
The, let’s take a look at the summary statistics of the dataframe.
#getting summary statistics of dataframe
skimmed <- skim(listings)
#Using kable to make the table cleaner
kbl(skimmed) %>%
kable_classic(full_width = F, html_font = "Cambria")| skim_type | skim_variable | n_missing | complete_rate | character.min | character.max | character.empty | character.n_unique | character.whitespace | Date.min | Date.max | Date.median | Date.n_unique | logical.mean | logical.count | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| character | listing_url | 0 | 1.0000000 | 33 | 37 | 0 | 18438 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | name | 5 | 0.9997288 | 1 | 244 | 0 | 17864 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | description | 823 | 0.9553639 | 1 | 1000 | 0 | 16957 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighborhood_overview | 7166 | 0.6113461 | 1 | 1000 | 0 | 9920 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | picture_url | 0 | 1.0000000 | 60 | 126 | 0 | 17995 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_url | 0 | 1.0000000 | 38 | 43 | 0 | 11412 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_name | 5 | 0.9997288 | 1 | 34 | 0 | 3139 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_location | 85 | 0.9953900 | 2 | 204 | 0 | 957 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_about | 7875 | 0.5728929 | 1 | 4037 | 0 | 5547 | 24 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_time | 5 | 0.9997288 | 3 | 18 | 0 | 5 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_response_rate | 5 | 0.9997288 | 2 | 4 | 0 | 66 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_acceptance_rate | 5 | 0.9997288 | 2 | 4 | 0 | 87 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_thumbnail_url | 5 | 0.9997288 | 55 | 106 | 0 | 11326 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_picture_url | 5 | 0.9997288 | 57 | 109 | 0 | 11326 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_neighbourhood | 3214 | 0.8256861 | 4 | 33 | 0 | 105 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | host_verifications | 0 | 1.0000000 | 2 | 180 | 0 | 343 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood | 7166 | 0.6113461 | 9 | 93 | 0 | 739 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | neighbourhood_cleansed | 0 | 1.0000000 | 4 | 17 | 0 | 48 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | property_type | 0 | 1.0000000 | 3 | 35 | 0 | 69 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | room_type | 0 | 1.0000000 | 10 | 15 | 0 | 4 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | bathrooms_text | 74 | 0.9959865 | 6 | 17 | 0 | 44 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | amenities | 0 | 1.0000000 | 2 | 1517 | 0 | 17064 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| character | price | 0 | 1.0000000 | 5 | 13 | 0 | 2142 | 0 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | last_scraped | 0 | 1.0000000 | NA | NA | NA | NA | NA | 2021-09-28 | 2021-10-06 | 2021-09-29 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | host_since | 5 | 0.9997288 | NA | NA | NA | NA | NA | 2008-08-29 | 2021-09-27 | 2016-04-06 | 3389 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | calendar_last_scraped | 0 | 1.0000000 | NA | NA | NA | NA | NA | 2021-09-28 | 2021-10-06 | 2021-09-29 | 4 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | first_review | 5994 | 0.6749105 | NA | NA | NA | NA | NA | 2010-10-19 | 2021-09-27 | 2019-01-11 | 2699 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| Date | last_review | 5994 | 0.6749105 | NA | NA | NA | NA | NA | 2010-12-30 | 2021-09-29 | 2019-12-06 | 2066 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_is_superhost | 5 | 0.9997288 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.2540552 | FAL: 13750, TRU: 4683 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_has_profile_pic | 5 | 0.9997288 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.9952802 | TRU: 18346, FAL: 87 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | host_identity_verified | 5 | 0.9997288 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.6693973 | TRU: 12339, FAL: 6094 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | neighbourhood_group_cleansed | 18438 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | bathrooms | 18438 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | calendar_updated | 18438 | 0.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | : | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | has_availability | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.9981560 | TRU: 18404, FAL: 34 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | license | 18437 | 0.0000542 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.0000000 | FAL: 1 | NA | NA | NA | NA | NA | NA | NA | NA |
| logical | instant_bookable | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0.3884912 | FAL: 11275, TRU: 7163 | NA | NA | NA | NA | NA | NA | NA | NA |
| numeric | id | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.903211e+07 | 1.446644e+07 | 6.283000e+03 | 1.866906e+07 | 3.195715e+07 | 4.034960e+07 | 5.251016e+07 | <U+2583><U+2583><U+2585><U+2587><U+2585> |
| numeric | scrape_id | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.021093e+13 | 0.000000e+00 | 2.021093e+13 | 2.021093e+13 | 2.021093e+13 | 2.021093e+13 | 2.021093e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| numeric | host_id | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.094351e+08 | 1.091353e+08 | 2.616000e+03 | 1.382351e+07 | 6.577595e+07 | 1.912728e+08 | 4.248618e+08 | <U+2587><U+2582><U+2582><U+2582><U+2581> |
| numeric | host_listings_count | 5 | 0.9997288 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.490696e+00 | 1.921978e+01 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+00 | 1.800000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | host_total_listings_count | 5 | 0.9997288 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.490696e+00 | 1.921978e+01 | 0.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+00 | 1.800000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | latitude | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | -3.459238e+01 | 1.796680e-02 | -3.468962e+01 | -3.460294e+01 | -3.459159e+01 | -3.458205e+01 | -3.453498e+01 | <U+2581><U+2581><U+2585><U+2587><U+2581> |
| numeric | longitude | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | -5.841514e+01 | 2.948310e-02 | -5.853093e+01 | -5.843450e+01 | -5.841437e+01 | -5.839106e+01 | -5.835541e+01 | <U+2581><U+2581><U+2586><U+2587><U+2585> |
| numeric | accommodates | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.793470e+00 | 1.529581e+00 | 0.000000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| numeric | bedrooms | 2820 | 0.8470550 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.351774e+00 | 9.335122e-01 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.100000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | beds | 208 | 0.9887189 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.901481e+00 | 1.786956e+00 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 9.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | minimum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 7.136891e+00 | 2.102563e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 7.300000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 6.638437e+02 | 9.239233e+02 | 1.000000e+00 | 9.000000e+01 | 1.125000e+03 | 1.125000e+03 | 9.999900e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | minimum_minimum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 7.044582e+00 | 2.078429e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 7.300000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_minimum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 7.197364e+00 | 2.088650e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 7.300000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | minimum_maximum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 8.064373e+02 | 5.331394e+02 | 1.000000e+00 | 1.800000e+02 | 1.125000e+03 | 1.125000e+03 | 3.000000e+04 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_maximum_nights | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.666902e+05 | 3.162769e+07 | 1.000000e+00 | 1.820000e+02 | 1.125000e+03 | 1.125000e+03 | 2.147484e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | minimum_nights_avg_ntm | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 7.116434e+00 | 2.080876e+01 | 1.000000e+00 | 2.000000e+00 | 3.000000e+00 | 5.000000e+00 | 7.300000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | maximum_nights_avg_ntm | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 3.408627e+05 | 2.308562e+07 | 1.000000e+00 | 1.810000e+02 | 1.125000e+03 | 1.125000e+03 | 1.573841e+09 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | availability_30 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.002300e+01 | 1.232825e+01 | 0.000000e+00 | 6.000000e+00 | 2.800000e+01 | 3.000000e+01 | 3.000000e+01 | <U+2583><U+2581><U+2581><U+2581><U+2587> |
| numeric | availability_60 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.308358e+01 | 2.316322e+01 | 0.000000e+00 | 2.700000e+01 | 5.800000e+01 | 6.000000e+01 | 6.000000e+01 | <U+2582><U+2581><U+2581><U+2581><U+2587> |
| numeric | availability_90 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 6.745482e+01 | 3.260183e+01 | 0.000000e+00 | 5.600000e+01 | 8.800000e+01 | 9.000000e+01 | 9.000000e+01 | <U+2582><U+2581><U+2581><U+2581><U+2587> |
| numeric | availability_365 | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2.355917e+02 | 1.248700e+02 | 0.000000e+00 | 1.030000e+02 | 2.690000e+02 | 3.640000e+02 | 3.650000e+02 | <U+2582><U+2583><U+2583><U+2582><U+2587> |
| numeric | number_of_reviews | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.612469e+01 | 3.375891e+01 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 1.600000e+01 | 5.040000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | number_of_reviews_ltm | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.307246e+00 | 4.495545e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 9.800000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | number_of_reviews_l30d | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1.308168e-01 | 5.903010e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.200000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | review_scores_rating | 5994 | 0.6749105 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.624139e+00 | 8.416064e-01 | 0.000000e+00 | 4.650000e+00 | 4.860000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_accuracy | 6276 | 0.6596160 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.797227e+00 | 4.507652e-01 | 0.000000e+00 | 4.780000e+00 | 4.940000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_cleanliness | 6277 | 0.6595618 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.678515e+00 | 5.143728e-01 | 0.000000e+00 | 4.590000e+00 | 4.830000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_checkin | 6276 | 0.6596160 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.873893e+00 | 3.856636e-01 | 0.000000e+00 | 4.890000e+00 | 5.000000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_communication | 6276 | 0.6596160 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.859365e+00 | 4.005052e-01 | 1.000000e+00 | 4.880000e+00 | 5.000000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_location | 6277 | 0.6595618 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.877528e+00 | 3.287340e-01 | 1.000000e+00 | 4.880000e+00 | 5.000000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | review_scores_value | 6280 | 0.6593991 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 4.693264e+00 | 4.856298e-01 | 0.000000e+00 | 4.620000e+00 | 4.820000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| numeric | calculated_host_listings_count | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 7.811259e+00 | 1.841884e+01 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.370000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | calculated_host_listings_count_entire_homes | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 7.028691e+00 | 1.833470e+01 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 3.000000e+00 | 1.370000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | calculated_host_listings_count_private_rooms | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 5.451242e-01 | 1.760177e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 2.100000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | calculated_host_listings_count_shared_rooms | 0 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 6.996420e-02 | 6.805771e-01 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.700000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| numeric | reviews_per_month | 5994 | 0.6749105 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 6.032682e-01 | 7.558439e-01 | 1.000000e-02 | 1.100000e-01 | 3.200000e-01 | 8.100000e-01 | 8.110000e+00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
There are 18438 observations and 74 variables in the dataframe.Among this 23 variables are in character format, 5 are in date format, 9 are in logical format, and 37 are in numeric format.
The following variables are in numeric format:
# finding the columns which are numeric
nums <- unlist(lapply(listings, is.numeric))
#selecting only numeric variables
kbl(colnames(listings[,nums])) %>%
kable_classic(full_width = F, html_font = "Cambria")| x |
|---|
| id |
| scrape_id |
| host_id |
| host_listings_count |
| host_total_listings_count |
| latitude |
| longitude |
| accommodates |
| bedrooms |
| beds |
| minimum_nights |
| maximum_nights |
| minimum_minimum_nights |
| maximum_minimum_nights |
| minimum_maximum_nights |
| maximum_maximum_nights |
| minimum_nights_avg_ntm |
| maximum_nights_avg_ntm |
| availability_30 |
| availability_60 |
| availability_90 |
| availability_365 |
| number_of_reviews |
| number_of_reviews_ltm |
| number_of_reviews_l30d |
| review_scores_rating |
| review_scores_accuracy |
| review_scores_cleanliness |
| review_scores_checkin |
| review_scores_communication |
| review_scores_location |
| review_scores_value |
| calculated_host_listings_count |
| calculated_host_listings_count_entire_homes |
| calculated_host_listings_count_private_rooms |
| calculated_host_listings_count_shared_rooms |
| reviews_per_month |
The following are categorical or factor variables (numeric or character variables with variables that have a fixed and known set of possible values.
# Getting column numbers of categorical variables
fact <- unlist(lapply(listings, is.character))
# listing only categorical variables
kbl(colnames(listings[,fact])) %>%
kable_classic(full_width = F, html_font = "Cambria")| x |
|---|
| listing_url |
| name |
| description |
| neighborhood_overview |
| picture_url |
| host_url |
| host_name |
| host_location |
| host_about |
| host_response_time |
| host_response_rate |
| host_acceptance_rate |
| host_thumbnail_url |
| host_picture_url |
| host_neighbourhood |
| host_verifications |
| neighbourhood |
| neighbourhood_cleansed |
| property_type |
| room_type |
| bathrooms_text |
| amenities |
| price |
# to show the correlation between all the numeric variables in the dataframe
kbl(cor(listings[,nums])) %>%
kable_classic(full_width = F, html_font = "Cambria")| id | scrape_id | host_id | host_listings_count | host_total_listings_count | latitude | longitude | accommodates | bedrooms | beds | minimum_nights | maximum_nights | minimum_minimum_nights | maximum_minimum_nights | minimum_maximum_nights | maximum_maximum_nights | minimum_nights_avg_ntm | maximum_nights_avg_ntm | availability_30 | availability_60 | availability_90 | availability_365 | number_of_reviews | number_of_reviews_ltm | number_of_reviews_l30d | review_scores_rating | review_scores_accuracy | review_scores_cleanliness | review_scores_checkin | review_scores_communication | review_scores_location | review_scores_value | calculated_host_listings_count | calculated_host_listings_count_entire_homes | calculated_host_listings_count_private_rooms | calculated_host_listings_count_shared_rooms | reviews_per_month | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| id | 1.0000000 | NA | 0.4829823 | NA | NA | 0.0144202 | -0.0368658 | 0.0256595 | NA | NA | -0.0326757 | -0.0260183 | -0.0321714 | -0.0349780 | 0.0426763 | 0.0167889 | -0.0332481 | 0.0167893 | -0.0163400 | -0.0099564 | -0.0069541 | -0.1354264 | -0.3550375 | 0.0552288 | 0.0648765 | NA | NA | NA | NA | NA | NA | NA | 0.0638544 | 0.0683429 | -0.0536755 | 0.0093959 | NA |
| scrape_id | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| host_id | 0.4829823 | NA | 1.0000000 | NA | NA | -0.0908530 | 0.0046427 | -0.0320289 | NA | NA | -0.0424129 | -0.0387636 | -0.0436154 | -0.0456054 | -0.0019859 | 0.0343178 | -0.0444780 | 0.0343177 | 0.0758486 | 0.0701194 | 0.0622490 | -0.0677422 | -0.1804557 | -0.0106043 | 0.0017619 | NA | NA | NA | NA | NA | NA | NA | -0.1879010 | -0.1863639 | -0.0106754 | 0.0351991 | NA |
| host_listings_count | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| host_total_listings_count | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| latitude | 0.0144202 | NA | -0.0908530 | NA | NA | 1.0000000 | -0.5372205 | 0.0139465 | NA | NA | 0.0419546 | 0.0037940 | 0.0423345 | 0.0417211 | 0.0167720 | 0.0005743 | 0.0425886 | 0.0005749 | -0.1054029 | -0.0954205 | -0.0911168 | -0.0358769 | 0.0214887 | 0.0637330 | 0.0492606 | NA | NA | NA | NA | NA | NA | NA | 0.0541229 | 0.0686779 | -0.1154082 | -0.0462795 | NA |
| longitude | -0.0368658 | NA | 0.0046427 | NA | NA | -0.5372205 | 1.0000000 | 0.0524208 | NA | NA | -0.0213042 | 0.0168961 | -0.0228477 | -0.0220538 | 0.0309381 | 0.0120328 | -0.0230126 | 0.0120334 | 0.0766677 | 0.0665711 | 0.0636355 | 0.0299735 | 0.0441809 | -0.0274220 | -0.0161340 | NA | NA | NA | NA | NA | NA | NA | 0.0304587 | 0.0210860 | 0.0402895 | 0.0315477 | NA |
| accommodates | 0.0256595 | NA | -0.0320289 | NA | NA | 0.0139465 | 0.0524208 | 1.0000000 | NA | NA | -0.0035614 | 0.0374635 | -0.0062106 | -0.0062365 | 0.0691408 | 0.0019902 | -0.0060214 | 0.0019776 | -0.0707786 | -0.0636930 | -0.0609931 | -0.0186892 | 0.0503732 | 0.0416317 | 0.0377335 | NA | NA | NA | NA | NA | NA | NA | 0.0652512 | 0.0802937 | -0.0921696 | -0.0385339 | NA |
| bedrooms | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| beds | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| minimum_nights | -0.0326757 | NA | -0.0424129 | NA | NA | 0.0419546 | -0.0213042 | -0.0035614 | NA | NA | 1.0000000 | 0.0049291 | 0.9840816 | 0.9831987 | -0.0043955 | -0.0042997 | 0.9851330 | -0.0042997 | 0.0116827 | 0.0060989 | 0.0030280 | 0.0416406 | -0.0522061 | -0.0532184 | -0.0466206 | NA | NA | NA | NA | NA | NA | NA | 0.0184966 | 0.0218590 | -0.0237698 | -0.0104447 | NA |
| maximum_nights | -0.0260183 | NA | -0.0387636 | NA | NA | 0.0037940 | 0.0168961 | 0.0374635 | NA | NA | 0.0049291 | 1.0000000 | 0.0050556 | 0.0056016 | 0.4818919 | 0.0073608 | 0.0052267 | 0.0073638 | -0.0327755 | -0.0348187 | -0.0338110 | 0.0498394 | -0.0029126 | 0.0033501 | 0.0051649 | NA | NA | NA | NA | NA | NA | NA | 0.0787046 | 0.0756952 | 0.0087460 | -0.0149564 | NA |
| minimum_minimum_nights | -0.0321714 | NA | -0.0436154 | NA | NA | 0.0423345 | -0.0228477 | -0.0062106 | NA | NA | 0.9840816 | 0.0050556 | 1.0000000 | 0.9957239 | -0.0042097 | -0.0042842 | 0.9989711 | -0.0042842 | 0.0142509 | 0.0078667 | 0.0045909 | 0.0432534 | -0.0539792 | -0.0541510 | -0.0472828 | NA | NA | NA | NA | NA | NA | NA | 0.0197345 | 0.0229857 | -0.0229149 | -0.0101708 | NA |
| maximum_minimum_nights | -0.0349780 | NA | -0.0456054 | NA | NA | 0.0417211 | -0.0220538 | -0.0062365 | NA | NA | 0.9831987 | 0.0056016 | 0.9957239 | 1.0000000 | -0.0039128 | -0.0043710 | 0.9983256 | -0.0043710 | 0.0094797 | 0.0044184 | 0.0016244 | 0.0423954 | -0.0531178 | -0.0529026 | -0.0461870 | NA | NA | NA | NA | NA | NA | NA | 0.0198667 | 0.0228781 | -0.0198810 | -0.0106594 | NA |
| minimum_maximum_nights | 0.0426763 | NA | -0.0019859 | NA | NA | 0.0167720 | 0.0309381 | 0.0691408 | NA | NA | -0.0043955 | 0.4818919 | -0.0042097 | -0.0039128 | 1.0000000 | -0.0214643 | -0.0039425 | -0.0214580 | -0.0735273 | -0.0719075 | -0.0665879 | 0.0357631 | 0.0461889 | 0.0643887 | 0.0508856 | NA | NA | NA | NA | NA | NA | NA | 0.0763266 | 0.0742032 | 0.0096837 | -0.0457397 | NA |
| maximum_maximum_nights | 0.0167889 | NA | 0.0343178 | NA | NA | 0.0005743 | 0.0120328 | 0.0019902 | NA | NA | -0.0042997 | 0.0073608 | -0.0042842 | -0.0043710 | -0.0214643 | 1.0000000 | -0.0043300 | 0.9999973 | -0.0060032 | -0.0083219 | -0.0101472 | 0.0026441 | -0.0070354 | -0.0042825 | -0.0032636 | NA | NA | NA | NA | NA | NA | NA | -0.0030469 | -0.0056459 | 0.0289141 | -0.0015151 | NA |
| minimum_nights_avg_ntm | -0.0332481 | NA | -0.0444780 | NA | NA | 0.0425886 | -0.0230126 | -0.0060214 | NA | NA | 0.9851330 | 0.0052267 | 0.9989711 | 0.9983256 | -0.0039425 | -0.0043300 | 1.0000000 | -0.0043301 | 0.0119852 | 0.0059573 | 0.0027342 | 0.0429191 | -0.0534382 | -0.0532171 | -0.0468338 | NA | NA | NA | NA | NA | NA | NA | 0.0198302 | 0.0230716 | -0.0224429 | -0.0104778 | NA |
| maximum_nights_avg_ntm | 0.0167893 | NA | 0.0343177 | NA | NA | 0.0005749 | 0.0120334 | 0.0019776 | NA | NA | -0.0042997 | 0.0073638 | -0.0042842 | -0.0043710 | -0.0214580 | 0.9999973 | -0.0043301 | 1.0000000 | -0.0059794 | -0.0082965 | -0.0101201 | 0.0026584 | -0.0070350 | -0.0042820 | -0.0032633 | NA | NA | NA | NA | NA | NA | NA | -0.0030464 | -0.0056455 | 0.0289141 | -0.0015154 | NA |
| availability_30 | -0.0163400 | NA | 0.0758486 | NA | NA | -0.1054029 | 0.0766677 | -0.0707786 | NA | NA | 0.0116827 | -0.0327755 | 0.0142509 | 0.0094797 | -0.0735273 | -0.0060032 | 0.0119852 | -0.0059794 | 1.0000000 | 0.9566171 | 0.9106184 | 0.3732122 | -0.1802830 | -0.1525458 | -0.0922873 | NA | NA | NA | NA | NA | NA | NA | -0.0527074 | -0.0644003 | 0.0889616 | -0.0113171 | NA |
| availability_60 | -0.0099564 | NA | 0.0701194 | NA | NA | -0.0954205 | 0.0665711 | -0.0636930 | NA | NA | 0.0060989 | -0.0348187 | 0.0078667 | 0.0044184 | -0.0719075 | -0.0083219 | 0.0059573 | -0.0082965 | 0.9566171 | 1.0000000 | 0.9794879 | 0.4035407 | -0.1658692 | -0.1268526 | -0.0734154 | NA | NA | NA | NA | NA | NA | NA | -0.0305657 | -0.0397934 | 0.0771196 | -0.0228552 | NA |
| availability_90 | -0.0069541 | NA | 0.0622490 | NA | NA | -0.0911168 | 0.0636355 | -0.0609931 | NA | NA | 0.0030280 | -0.0338110 | 0.0045909 | 0.0016244 | -0.0665879 | -0.0101472 | 0.0027342 | -0.0101201 | 0.9106184 | 0.9794879 | 1.0000000 | 0.4245896 | -0.1551103 | -0.1153777 | -0.0642781 | NA | NA | NA | NA | NA | NA | NA | -0.0129807 | -0.0204452 | 0.0684765 | -0.0316898 | NA |
| availability_365 | -0.1354264 | NA | -0.0677422 | NA | NA | -0.0358769 | 0.0299735 | -0.0186892 | NA | NA | 0.0416406 | 0.0498394 | 0.0432534 | 0.0423954 | 0.0357631 | 0.0026441 | 0.0429191 | 0.0026584 | 0.3732122 | 0.4035407 | 0.4245896 | 1.0000000 | -0.0533246 | -0.1014549 | -0.0689872 | NA | NA | NA | NA | NA | NA | NA | 0.0538917 | 0.0411271 | 0.0617789 | 0.0314637 | NA |
| number_of_reviews | -0.3550375 | NA | -0.1804557 | NA | NA | 0.0214887 | 0.0441809 | 0.0503732 | NA | NA | -0.0522061 | -0.0029126 | -0.0539792 | -0.0531178 | 0.0461889 | -0.0070354 | -0.0534382 | -0.0070350 | -0.1802830 | -0.1658692 | -0.1551103 | -0.0533246 | 1.0000000 | 0.3348992 | 0.2387524 | NA | NA | NA | NA | NA | NA | NA | -0.0554002 | -0.0450813 | -0.0631179 | -0.0409227 | NA |
| number_of_reviews_ltm | 0.0552288 | NA | -0.0106043 | NA | NA | 0.0637330 | -0.0274220 | 0.0416317 | NA | NA | -0.0532184 | 0.0033501 | -0.0541510 | -0.0529026 | 0.0643887 | -0.0042825 | -0.0532171 | -0.0042820 | -0.1525458 | -0.1268526 | -0.1153777 | -0.1014549 | 0.3348992 | 1.0000000 | 0.7061156 | NA | NA | NA | NA | NA | NA | NA | 0.0530862 | 0.0624504 | -0.0651933 | -0.0274130 | NA |
| number_of_reviews_l30d | 0.0648765 | NA | 0.0017619 | NA | NA | 0.0492606 | -0.0161340 | 0.0377335 | NA | NA | -0.0466206 | 0.0051649 | -0.0472828 | -0.0461870 | 0.0508856 | -0.0032636 | -0.0468338 | -0.0032633 | -0.0922873 | -0.0734154 | -0.0642781 | -0.0689872 | 0.2387524 | 0.7061156 | 1.0000000 | NA | NA | NA | NA | NA | NA | NA | 0.0271289 | 0.0345929 | -0.0511487 | -0.0207579 | NA |
| review_scores_rating | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| review_scores_accuracy | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| review_scores_cleanliness | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| review_scores_checkin | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA | NA |
| review_scores_communication | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA | NA |
| review_scores_location | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA | NA |
| review_scores_value | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | NA | NA | NA | NA | NA |
| calculated_host_listings_count | 0.0638544 | NA | -0.1879010 | NA | NA | 0.0541229 | 0.0304587 | 0.0652512 | NA | NA | 0.0184966 | 0.0787046 | 0.0197345 | 0.0198667 | 0.0763266 | -0.0030469 | 0.0198302 | -0.0030464 | -0.0527074 | -0.0305657 | -0.0129807 | 0.0538917 | -0.0554002 | 0.0530862 | 0.0271289 | NA | NA | NA | NA | NA | NA | NA | 1.0000000 | 0.9862029 | 0.0262926 | 0.0333663 | NA |
| calculated_host_listings_count_entire_homes | 0.0683429 | NA | -0.1863639 | NA | NA | 0.0686779 | 0.0210860 | 0.0802937 | NA | NA | 0.0218590 | 0.0756952 | 0.0229857 | 0.0228781 | 0.0742032 | -0.0056459 | 0.0230716 | -0.0056455 | -0.0644003 | -0.0397934 | -0.0204452 | 0.0411271 | -0.0450813 | 0.0624504 | 0.0345929 | NA | NA | NA | NA | NA | NA | NA | 0.9862029 | 1.0000000 | -0.0708269 | -0.0379510 | NA |
| calculated_host_listings_count_private_rooms | -0.0536755 | NA | -0.0106754 | NA | NA | -0.1154082 | 0.0402895 | -0.0921696 | NA | NA | -0.0237698 | 0.0087460 | -0.0229149 | -0.0198810 | 0.0096837 | 0.0289141 | -0.0224429 | 0.0289141 | 0.0889616 | 0.0771196 | 0.0684765 | 0.0617789 | -0.0631179 | -0.0651933 | -0.0511487 | NA | NA | NA | NA | NA | NA | NA | 0.0262926 | -0.0708269 | 1.0000000 | 0.0823490 | NA |
| calculated_host_listings_count_shared_rooms | 0.0093959 | NA | 0.0351991 | NA | NA | -0.0462795 | 0.0315477 | -0.0385339 | NA | NA | -0.0104447 | -0.0149564 | -0.0101708 | -0.0106594 | -0.0457397 | -0.0015151 | -0.0104778 | -0.0015154 | -0.0113171 | -0.0228552 | -0.0316898 | 0.0314637 | -0.0409227 | -0.0274130 | -0.0207579 | NA | NA | NA | NA | NA | NA | NA | 0.0333663 | -0.0379510 | 0.0823490 | 1.0000000 | NA |
| reviews_per_month | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 |
Since there are too many variables in the original dataframe, we hand-selected some variables that our group members think people would pay more attention to when booking and AirBnB and made a scatterplot matrix.
#to plot scatterplot matrix
matrix <- ggpairs(listings[,c("accommodates","bedrooms","minimum_nights","maximum_nights", "number_of_reviews", "review_scores_rating", "reviews_per_month", "review_scores_accuracy" ,"host_identity_verified","instant_bookable", "host_response_time", "room_type")])
matrixAccommodates and number of bedrooms, number of reviews and reviews per month, and review score accuracy and review score rating have a strong correlation.
We noticed that some of the price data (price) is given as a character string, e.g., “$176.00”. Since price is a quantitative variable, we need to make sure it is stored as numeric data num in the dataframe.
listings <- listings %>%
#to change the price variable into numeric format
mutate(price = parse_number(price))
#to check that the changes are implemented
typeof(listings$price)[1] "double"
There are to many property types and we want to simplify it.The top 4 property types are: Entire rental unit, private room in rental unit, Entire condominium (condo) and Private room in residential home.
# count the number of listings that fall under one of the 4 top property types
count_top_4_prop_types <- listings %>%
filter(property_type %in% c("Entire rental unit",
"Private room in rental unit",
"Entire condominium (condo)",
"Private room in residential home")) %>%
summarize(sum = n())
# count the total number of listing
total_listing_num <- listings %>%
summarize(sum = n())
count_top_4_prop_types/total_listing_num| sum |
|---|
| 0.867 |
Since the vast majority of the observations in the data are one of the top four or five property types, we would like to create a simplified version of property_type variable named prop_type_simplified that has 5 categories: the top four categories and Other.
listings <- listings %>%
# to create a new variable named prop_type_simplified
mutate(prop_type_simplified = case_when(
# to set property types which are not among the top 4 types to "Other"
property_type %in% c("Entire rental unit", "Private room in rental unit","Entire condominium (condo)","Private room in residential home") ~ property_type,
TRUE ~ "Other"
))We then used the code below to confirm that prop_type_simplified was correctly made.
listings %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n)) | property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 12395 |
| Private room in rental unit | Private room in rental unit | 2042 |
| Entire condominium (condo) | Entire condominium (condo) | 784 |
| Private room in residential home | Private room in residential home | 771 |
| Entire loft | Other | 600 |
| Entire residential home | Other | 337 |
| Entire serviced apartment | Other | 295 |
| Shared room in rental unit | Other | 207 |
| Private room in serviced apartment | Other | 88 |
| Shared room in residential home | Other | 83 |
| Private room in condominium (condo) | Other | 74 |
| Room in hotel | Other | 68 |
| Room in boutique hotel | Other | 65 |
| Private room in bed and breakfast | Other | 63 |
| Room in bed and breakfast | Other | 58 |
| Private room in loft | Other | 38 |
| Room in hostel | Other | 37 |
| Room in serviced apartment | Other | 37 |
| Private room in guest suite | Other | 36 |
| Entire guest suite | Other | 28 |
| Private room in casa particular | Other | 25 |
| Room in aparthotel | Other | 21 |
| Shared room in guesthouse | Other | 21 |
| Entire townhouse | Other | 20 |
| Private room in guesthouse | Other | 20 |
| Shared room in hostel | Other | 18 |
| Private room in hostel | Other | 17 |
| Private room | Other | 16 |
| Entire place | Other | 15 |
| Entire guesthouse | Other | 14 |
| Casa particular | Other | 12 |
| Private room in villa | Other | 12 |
| Private room in townhouse | Other | 11 |
| Shared room in bed and breakfast | Other | 10 |
| Shared room in condominium (condo) | Other | 9 |
| Tiny house | Other | 9 |
| Entire villa | Other | 8 |
| Shared room in loft | Other | 8 |
| Shared room in serviced apartment | Other | 7 |
| Shared room in villa | Other | 7 |
| Camper/RV | Other | 5 |
| Private room in chalet | Other | 4 |
| Private room in tiny house | Other | 4 |
| Entire cabin | Other | 3 |
| Entire home/apt | Other | 3 |
| Shared room | Other | 3 |
| Boat | Other | 2 |
| Earth house | Other | 2 |
| Entire cottage | Other | 2 |
| Private room in cabin | Other | 2 |
| Shared room in boutique hotel | Other | 2 |
| Shared room in guest suite | Other | 2 |
| Shared room in townhouse | Other | 2 |
| Campsite | Other | 1 |
| Car | Other | 1 |
| Cycladic house | Other | 1 |
| Entire bed and breakfast | Other | 1 |
| Entire in-law | Other | 1 |
| Floor | Other | 1 |
| Pension | Other | 1 |
| Private room in boat | Other | 1 |
| Private room in castle | Other | 1 |
| Private room in dome house | Other | 1 |
| Private room in dorm | Other | 1 |
| Private room in farm stay | Other | 1 |
| Private room in floor | Other | 1 |
| Private room in in-law | Other | 1 |
| Private room in resort | Other | 1 |
| Treehouse | Other | 1 |
Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:
We first view the minimum_nights. Then we take a look at what are the major minimum nights of all the listings using a density plot.
# count the listings with different minimum nights
min_nights<- table(listings$minimum_nights)
kbl(min_nights) %>%
kable_classic(full_width = F, html_font = "Cambria")| Var1 | Freq |
|---|---|
| 1 | 4274 |
| 2 | 3767 |
| 3 | 3741 |
| 4 | 1237 |
| 5 | 1223 |
| 6 | 331 |
| 7 | 1630 |
| 8 | 12 |
| 9 | 18 |
| 10 | 239 |
| 11 | 2 |
| 12 | 20 |
| 13 | 6 |
| 14 | 181 |
| 15 | 338 |
| 16 | 3 |
| 17 | 1 |
| 18 | 3 |
| 19 | 5 |
| 20 | 112 |
| 21 | 27 |
| 22 | 2 |
| 24 | 2 |
| 25 | 29 |
| 26 | 3 |
| 27 | 7 |
| 28 | 168 |
| 29 | 36 |
| 30 | 626 |
| 31 | 14 |
| 40 | 11 |
| 45 | 5 |
| 50 | 3 |
| 55 | 1 |
| 58 | 1 |
| 60 | 80 |
| 61 | 1 |
| 65 | 1 |
| 71 | 1 |
| 75 | 1 |
| 79 | 1 |
| 80 | 3 |
| 85 | 1 |
| 89 | 2 |
| 90 | 144 |
| 92 | 1 |
| 100 | 8 |
| 112 | 1 |
| 120 | 36 |
| 130 | 6 |
| 150 | 2 |
| 175 | 1 |
| 179 | 1 |
| 180 | 34 |
| 200 | 5 |
| 240 | 1 |
| 300 | 4 |
| 359 | 4 |
| 360 | 4 |
| 365 | 15 |
| 500 | 1 |
| 730 | 1 |
#make a density plot of minimum nights of the listings
ggplot(listings,aes(x=minimum_nights))+
geom_density()+
theme_bw()+
labs (
title = "Density Plot of the Minimum Nights for all the Listings",
x = "Minimum Nights",
y = "Density"
)+
NULL The most common values are 1,2,3,7 and 4 nights.
The 7 nights stand out as it is more common than 4 nights. This may be because most people tend to go out for a week which lead to 7 nights being more common than 4 nights. Furthermore, Many Airbnb hosts give out discounts for booking 7 nights which leads to higher sales.
Since we are forecasting the cost of 2 people staying for 4 nights, we will only select the data that have minimum_nights <= 4
listings_min4 <- listings %>%
#filtering for data set with minimum nights less than or equal to 4
filter(minimum_nights <= 4)Mapping the AirBnB locations to a map of Buenos Aires
# Creating a map with blue points referring to each AirBnB locaton in Buenos Aires
leaflet(data = filter(listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)We will use the cost for two people to stay at an Airbnb location for four (4) nights as our target variable \(Y\).
We created a new variable called price_4_nights that uses price, and accommodates to calculate the total cost for two people to stay at the Airbnb property for 4 nights. This is the variable \(Y\) we want to explain.
pricefor2 <- listings_min4 %>%
#filtering for minimum 2 accomodations and properties with greater than or equal to 4 nights
filter(accommodates >= 2, maximum_nights >= 4) %>%
#create the price_4_nights variable
mutate(price_4_nights = (price*4))We realized that the there may be extreme high/low values in price_4_nights which may seriously affect our regression analysis, so we used histograms to examine the distributions of price_4_nights and log(price_4_nights) to decide the actual variable we are going to use in the regression model.
#plot the histogram for `price_4_nights`
ggplot(pricefor2, aes(x = price_4_nights))+
geom_histogram()+
NULL#plot the histogram for `log(price_4_nights)`
ggplot(pricefor2, aes(x = log(price_4_nights)))+
geom_histogram()+
NULL Obviously, using
log(price_4_nights) is better in the regression analysis because the effect of extreme values is reduced so we create a new variable as follows:
logpricefor2 <- pricefor2 %>%
# creating a new variable for the logarithm of price_4_nights
mutate(logprice4nights = log(price_4_nights))For our initial model, we will be looking at the effect of property types, number of reviews and review score rating on log(price_4_nights)
#Creating a logarithmic regression model
model1 <- lm(logprice4nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating,
data= logpricefor2)
#displaying regression estimates
summary(model1)
Call:
lm(formula = logprice4nights ~ prop_type_simplified + number_of_reviews +
review_scores_rating, data = logpricefor2)
Residuals:
Min 1Q Median 3Q Max
-2.0701 -0.4468 -0.0944 0.3431 5.6667
Coefficients:
Estimate Std. Error
(Intercept) 9.6212151 0.0581534
prop_type_simplifiedEntire rental unit -0.0845437 0.0359128
prop_type_simplifiedOther 0.0554728 0.0410768
prop_type_simplifiedPrivate room in rental unit -0.6193214 0.0473098
prop_type_simplifiedPrivate room in residential home -0.7164308 0.0558697
number_of_reviews -0.0011363 0.0001735
review_scores_rating -0.0013568 0.0098422
t value Pr(>|t|)
(Intercept) 165.445 < 2e-16 ***
prop_type_simplifiedEntire rental unit -2.354 0.0186 *
prop_type_simplifiedOther 1.350 0.1769
prop_type_simplifiedPrivate room in rental unit -13.091 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -12.823 < 2e-16 ***
number_of_reviews -6.548 6.18e-11 ***
review_scores_rating -0.138 0.8904
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6627 on 8413 degrees of freedom
(3127 observations deleted due to missingness)
Multiple R-squared: 0.0623, Adjusted R-squared: 0.06163
F-statistic: 93.16 on 6 and 8413 DF, p-value: < 2.2e-16
Interpretation of coefficients:
A 1 unit increase in review_scores_rating leads to a decrease of price_4_nights by 0.13%
A 1 unit increase in number of reviews decreases the price_4_nights by 0.11%
If the property type is entire rental unit, the price_4_nights decreases by 8.11% compared to the property type being an entire condominium
If the property type is other, the price_4_nights increases by 5.70% compared to the property type being an entire condominium
If the property type is private room in rental unit,the price_4_nights decreases by 46.17% compared to the property type being an entire condominium
If the property type is entire rental unit, the price_4_nights decreases by 51.15% compared to the property type being an entire condominium
After creating the first model, we wanted to determine if room_type is a significant predictor of the price for 4 nights. We decided to create a model with all variables in model1 and room_type
#Creating a new logarithmic regression model with additional variables
model2 <- lm(logprice4nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
data= logpricefor2)
#displaying regression estimates
summary(model2)
Call:
lm(formula = logprice4nights ~ prop_type_simplified + number_of_reviews +
review_scores_rating + room_type, data = logpricefor2)
Residuals:
Min 1Q Median 3Q Max
-2.2188 -0.4403 -0.0949 0.3396 5.6679
Coefficients:
Estimate Std. Error
(Intercept) 9.6878663 0.0571688
prop_type_simplifiedEntire rental unit -0.0862458 0.0352109
prop_type_simplifiedOther 0.2311889 0.0424067
prop_type_simplifiedPrivate room in rental unit -0.0419747 0.0743405
prop_type_simplifiedPrivate room in residential home -0.1384833 0.0798644
number_of_reviews -0.0012826 0.0001704
review_scores_rating -0.0143446 0.0096839
room_typeHotel room 0.0743577 0.0863216
room_typePrivate room -0.5824204 0.0582948
room_typeShared room -1.4370858 0.0878270
t value Pr(>|t|)
(Intercept) 169.461 < 2e-16 ***
prop_type_simplifiedEntire rental unit -2.449 0.0143 *
prop_type_simplifiedOther 5.452 5.13e-08 ***
prop_type_simplifiedPrivate room in rental unit -0.565 0.5723
prop_type_simplifiedPrivate room in residential home -1.734 0.0830 .
number_of_reviews -7.527 5.74e-14 ***
review_scores_rating -1.481 0.1386
room_typeHotel room 0.861 0.3890
room_typePrivate room -9.991 < 2e-16 ***
room_typeShared room -16.363 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6498 on 8410 degrees of freedom
(3127 observations deleted due to missingness)
Multiple R-squared: 0.09892, Adjusted R-squared: 0.09796
F-statistic: 102.6 on 9 and 8410 DF, p-value: < 2.2e-16
#Running an F test on model 2 and model 1
anova(model2,model1)| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 8.41e+03 | 3.55e+03 | ||||
| 8.41e+03 | 3.7e+03 | -3 | -144 | 114 | 2.54e-72 |
We ran an F-test on model 2 and model 1 and determined that room_type is a significant predictor of the cost for 4 nights.
Our dataset contained many more variables. We decide to explore further variables to determine if they were significant predictors of price_4_nights
We started with number of bedrooms, beds, or size of the house (accomodates)
#Exploring further variables in regression
model3 <- lm(logprice4nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
bedrooms +
beds +
accommodates ,
data = logpricefor2)
#displaying regression estimates
summary(model3)
Call:
lm(formula = logprice4nights ~ prop_type_simplified + number_of_reviews +
review_scores_rating + bedrooms + beds + accommodates, data = logpricefor2)
Residuals:
Min 1Q Median 3Q Max
-3.7322 -0.3868 -0.0510 0.3335 5.8590
Coefficients:
Estimate Std. Error
(Intercept) 8.9399683 0.0589011
prop_type_simplifiedEntire rental unit -0.0968416 0.0361957
prop_type_simplifiedOther -0.0755567 0.0410870
prop_type_simplifiedPrivate room in rental unit -0.5066357 0.0460417
prop_type_simplifiedPrivate room in residential home -0.7395637 0.0532870
number_of_reviews -0.0008880 0.0001763
review_scores_rating 0.0098442 0.0094550
bedrooms 0.2801877 0.0147588
beds -0.0617795 0.0072554
accommodates 0.1348535 0.0088166
t value Pr(>|t|)
(Intercept) 151.779 < 2e-16 ***
prop_type_simplifiedEntire rental unit -2.676 0.00748 **
prop_type_simplifiedOther -1.839 0.06597 .
prop_type_simplifiedPrivate room in rental unit -11.004 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -13.879 < 2e-16 ***
number_of_reviews -5.036 4.86e-07 ***
review_scores_rating 1.041 0.29784
bedrooms 18.984 < 2e-16 ***
beds -8.515 < 2e-16 ***
accommodates 15.295 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5943 on 6914 degrees of freedom
(4623 observations deleted due to missingness)
Multiple R-squared: 0.2831, Adjusted R-squared: 0.2821
F-statistic: 303.3 on 9 and 6914 DF, p-value: < 2.2e-16
We then converted the character bathroom_text variable into a numeric variable and added it to model 3 to determine if it is a significant predictor.
#Creating a table with frequency of each bathroom type
bathroom_freq <- table(logpricefor2$bathrooms_text)
View(bathroom_freq)
#Extracting numeric values from bathroom text
logpricefor2_bathroom <- logpricefor2 %>%
#extracting digits from bathroom text
mutate(bathrooms_numeric = str_extract(bathrooms_text,"[[:digit:]]+")) %>%
#Parsing for numbers
mutate(bathrooms_numeric = parse_number(bathrooms_numeric))
#Adding bathroom numbers to the regression model
model4 <- lm(logprice4nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
bedrooms +
beds +
accommodates +
bathrooms_numeric ,
data= logpricefor2_bathroom)
#Displaying Regression estimates
summary(model4)
Call:
lm(formula = logprice4nights ~ prop_type_simplified + number_of_reviews +
review_scores_rating + bedrooms + beds + accommodates + bathrooms_numeric,
data = logpricefor2_bathroom)
Residuals:
Min 1Q Median 3Q Max
-5.1492 -0.3844 -0.0471 0.3321 5.8724
Coefficients:
Estimate Std. Error
(Intercept) 8.8997067 0.0591348
prop_type_simplifiedEntire rental unit -0.0948874 0.0361022
prop_type_simplifiedOther -0.0942160 0.0411347
prop_type_simplifiedPrivate room in rental unit -0.5265953 0.0461533
prop_type_simplifiedPrivate room in residential home -0.7678637 0.0533739
number_of_reviews -0.0008593 0.0001760
review_scores_rating 0.0115643 0.0094352
bedrooms 0.2537917 0.0154107
beds -0.0634780 0.0072426
accommodates 0.1257527 0.0089453
bathrooms_numeric 0.0846850 0.0146073
t value Pr(>|t|)
(Intercept) 150.499 < 2e-16 ***
prop_type_simplifiedEntire rental unit -2.628 0.0086 **
prop_type_simplifiedOther -2.290 0.0220 *
prop_type_simplifiedPrivate room in rental unit -11.410 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -14.387 < 2e-16 ***
number_of_reviews -4.883 1.07e-06 ***
review_scores_rating 1.226 0.2204
bedrooms 16.469 < 2e-16 ***
beds -8.765 < 2e-16 ***
accommodates 14.058 < 2e-16 ***
bathrooms_numeric 5.797 7.03e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5928 on 6897 degrees of freedom
(4639 observations deleted due to missingness)
Multiple R-squared: 0.2867, Adjusted R-squared: 0.2856
F-statistic: 277.2 on 10 and 6897 DF, p-value: < 2.2e-16
#Running a Variance Inflation Factor on the model created above
car::vif(model4) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.124220 4 1.014744
number_of_reviews 1.017573 1 1.008748
review_scores_rating 1.016955 1 1.008442
bedrooms 2.615285 1 1.617184
beds 2.746143 1 1.657149
accommodates 3.963503 1 1.990855
bathrooms_numeric 1.737532 1 1.318155
Upon creating the model, we ran a Variance Inflation Factor to determine if the problem suffered from multicollinearity and did not find any such problem. We did notice that review score rating has consistently have a p value greater than 0.05. We believe that this is likely due to both number of reviews and review score rating being present in the same model as people are likely to put in more reviews if they have negative opinions of the property.
We then decied to explore the effects of the host being a super host, whether the property is instantly bookable and if the property is available within 30 days. We ran two models with the variables to decide whether we should keep number of reviews or review score ratings.
#Adding new predictors to model and removing review score rating
model5 <- lm(logprice4nights ~
prop_type_simplified +
number_of_reviews +
bedrooms +
beds +
accommodates +
bathrooms_numeric +
host_is_superhost +
instant_bookable +
availability_30 ,
data= logpricefor2_bathroom)
#displaying regression estimates
summary(model5)
Call:
lm(formula = logprice4nights ~ prop_type_simplified + number_of_reviews +
bedrooms + beds + accommodates + bathrooms_numeric + host_is_superhost +
instant_bookable + availability_30, data = logpricefor2_bathroom)
Residuals:
Min 1Q Median 3Q Max
-5.3084 -0.4040 -0.0674 0.3356 5.9906
Coefficients:
Estimate Std. Error
(Intercept) 8.9572488 0.0360313
prop_type_simplifiedEntire rental unit -0.0937053 0.0307103
prop_type_simplifiedOther -0.1153607 0.0349925
prop_type_simplifiedPrivate room in rental unit -0.6044183 0.0392059
prop_type_simplifiedPrivate room in residential home -0.7592464 0.0461346
number_of_reviews -0.0009400 0.0001912
bedrooms 0.1227879 0.0093417
beds -0.0400995 0.0048689
accommodates 0.1302920 0.0067425
bathrooms_numeric 0.0963275 0.0125347
host_is_superhostTRUE -0.0015915 0.0162011
instant_bookableTRUE -0.0331453 0.0136200
availability_30 0.0080069 0.0005698
t value Pr(>|t|)
(Intercept) 248.597 < 2e-16 ***
prop_type_simplifiedEntire rental unit -3.051 0.002285 **
prop_type_simplifiedOther -3.297 0.000982 ***
prop_type_simplifiedPrivate room in rental unit -15.417 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -16.457 < 2e-16 ***
number_of_reviews -4.917 8.95e-07 ***
bedrooms 13.144 < 2e-16 ***
beds -8.236 < 2e-16 ***
accommodates 19.324 < 2e-16 ***
bathrooms_numeric 7.685 1.68e-14 ***
host_is_superhostTRUE -0.098 0.921746
instant_bookableTRUE -2.434 0.014969 *
availability_30 14.052 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.6535 on 9479 degrees of freedom
(2055 observations deleted due to missingness)
Multiple R-squared: 0.2368, Adjusted R-squared: 0.2359
F-statistic: 245.1 on 12 and 9479 DF, p-value: < 2.2e-16
#Adding new predictors to model and removing number of reviews
model6 <- lm(logprice4nights~ prop_type_simplified + review_scores_rating + bedrooms + beds + accommodates + bathrooms_numeric + instant_bookable + availability_30 , data= logpricefor2_bathroom)
#displaying regression estimates
summary(model6)
Call:
lm(formula = logprice4nights ~ prop_type_simplified + review_scores_rating +
bedrooms + beds + accommodates + bathrooms_numeric + instant_bookable +
availability_30, data = logpricefor2_bathroom)
Residuals:
Min 1Q Median 3Q Max
-5.1817 -0.3686 -0.0535 0.3203 6.0677
Coefficients:
Estimate Std. Error
(Intercept) 8.712996 0.060944
prop_type_simplifiedEntire rental unit -0.095423 0.035616
prop_type_simplifiedOther -0.117613 0.040611
prop_type_simplifiedPrivate room in rental unit -0.585803 0.045733
prop_type_simplifiedPrivate room in residential home -0.805900 0.052730
review_scores_rating 0.021160 0.009338
bedrooms 0.254587 0.015208
beds -0.062975 0.007145
accommodates 0.125169 0.008815
bathrooms_numeric 0.080815 0.014420
instant_bookableTRUE -0.039415 0.014449
availability_30 0.008223 0.000573
t value Pr(>|t|)
(Intercept) 142.967 < 2e-16 ***
prop_type_simplifiedEntire rental unit -2.679 0.00740 **
prop_type_simplifiedOther -2.896 0.00379 **
prop_type_simplifiedPrivate room in rental unit -12.809 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -15.284 < 2e-16 ***
review_scores_rating 2.266 0.02348 *
bedrooms 16.740 < 2e-16 ***
beds -8.813 < 2e-16 ***
accommodates 14.200 < 2e-16 ***
bathrooms_numeric 5.604 2.17e-08 ***
instant_bookableTRUE -2.728 0.00639 **
availability_30 14.351 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5848 on 6896 degrees of freedom
(4639 observations deleted due to missingness)
Multiple R-squared: 0.3058, Adjusted R-squared: 0.3047
F-statistic: 276.2 on 11 and 6896 DF, p-value: < 2.2e-16
#Creating a table with coefficients and statistical data of each regression model we created
huxreg(model1, model2, model3, model4, model5, model6)| (1) | (2) | (3) | (4) | (5) | (6) | |
|---|---|---|---|---|---|---|
| (Intercept) | 9.621 *** | 9.688 *** | 8.940 *** | 8.900 *** | 8.957 *** | 8.713 *** |
| (0.058) | (0.057) | (0.059) | (0.059) | (0.036) | (0.061) | |
| prop_type_simplifiedEntire rental unit | -0.085 * | -0.086 * | -0.097 ** | -0.095 ** | -0.094 ** | -0.095 ** |
| (0.036) | (0.035) | (0.036) | (0.036) | (0.031) | (0.036) | |
| prop_type_simplifiedOther | 0.055 | 0.231 *** | -0.076 | -0.094 * | -0.115 *** | -0.118 ** |
| (0.041) | (0.042) | (0.041) | (0.041) | (0.035) | (0.041) | |
| prop_type_simplifiedPrivate room in rental unit | -0.619 *** | -0.042 | -0.507 *** | -0.527 *** | -0.604 *** | -0.586 *** |
| (0.047) | (0.074) | (0.046) | (0.046) | (0.039) | (0.046) | |
| prop_type_simplifiedPrivate room in residential home | -0.716 *** | -0.138 | -0.740 *** | -0.768 *** | -0.759 *** | -0.806 *** |
| (0.056) | (0.080) | (0.053) | (0.053) | (0.046) | (0.053) | |
| number_of_reviews | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | ||
| review_scores_rating | -0.001 | -0.014 | 0.010 | 0.012 | 0.021 * | |
| (0.010) | (0.010) | (0.009) | (0.009) | (0.009) | ||
| room_typeHotel room | 0.074 | |||||
| (0.086) | ||||||
| room_typePrivate room | -0.582 *** | |||||
| (0.058) | ||||||
| room_typeShared room | -1.437 *** | |||||
| (0.088) | ||||||
| bedrooms | 0.280 *** | 0.254 *** | 0.123 *** | 0.255 *** | ||
| (0.015) | (0.015) | (0.009) | (0.015) | |||
| beds | -0.062 *** | -0.063 *** | -0.040 *** | -0.063 *** | ||
| (0.007) | (0.007) | (0.005) | (0.007) | |||
| accommodates | 0.135 *** | 0.126 *** | 0.130 *** | 0.125 *** | ||
| (0.009) | (0.009) | (0.007) | (0.009) | |||
| bathrooms_numeric | 0.085 *** | 0.096 *** | 0.081 *** | |||
| (0.015) | (0.013) | (0.014) | ||||
| host_is_superhostTRUE | -0.002 | |||||
| (0.016) | ||||||
| instant_bookableTRUE | -0.033 * | -0.039 ** | ||||
| (0.014) | (0.014) | |||||
| availability_30 | 0.008 *** | 0.008 *** | ||||
| (0.001) | (0.001) | |||||
| N | 8420 | 8420 | 6924 | 6908 | 9492 | 6908 |
| R2 | 0.062 | 0.099 | 0.283 | 0.287 | 0.237 | 0.306 |
| logLik | -8480.079 | -8312.351 | -6217.026 | -6183.779 | -9424.350 | -6089.866 |
| AIC | 16976.159 | 16646.702 | 12456.052 | 12391.558 | 18876.700 | 12205.731 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||
We ultimately decided to use model6 as people are more likely to be affected by review score rating than by the number of reviews. Furthermore, the model contained the highest \[R^2\] among the 6 models. The final model contains 9 variables with all of the variables having a p value less than 0.05.`
To determine if our model has a normal distribution, we used a Q-Q plot.
#Creating a Q-Q plot
autoplot(model6)[2]As per the plot, our model has most of the data packed in the middle with fat tails in the end.
We ran the Variance Inflation Factor again to ensure the final model does not suffer for collinear variables.
#Running a Variance Inflation Factor on our final model
car::vif(model6) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.155113 4 1.018188
review_scores_rating 1.023347 1 1.011606
bedrooms 2.616805 1 1.617654
beds 2.746113 1 1.657140
accommodates 3.954256 1 1.988531
bathrooms_numeric 1.739646 1 1.318956
instant_bookable 1.008725 1 1.004353
availability_30 1.049392 1 1.024398
After running VIF we realized that the final model did not suffer with collinear variables as VIF is less than 5 for all variables.
#filtering for dataset for our criteria in staying at Buenos Airies
filter <- logpricefor2_bathroom %>%
filter(prop_type_simplified %in% c("Private room in rental unit", "Private room in residential home"),
number_of_reviews >=10,
review_scores_rating>=4.5)
#Finding the predicted value of staying at Buenos Airies
model_predict <- predict(model6, newdata = filter, interval = "prediction")
#Finding the median of cost to stay
Cost_to_stay <- exp(median(model_predict[,1],
#ensuring na values are not used for median
na.rm = TRUE))
Cost_to_stay[1] 7243.908
#Creating a function that will create the confidence interval for us
confidence_interval <- function(vector, interval) {
# Standard deviation of sample
vec_sd <- sd(vector)
# Sample size
n <- length(vector)
# Mean of sample
vec_mean <- mean(vector)
# Error according to t distribution
error <- qt((interval + 1)/2, df = n - 1) * vec_sd / sqrt(n)
# Confidence interval as a vector
result <- c("lower" = vec_mean - error, "upper" = vec_mean + error)
return(result)
}
#finding the 95% confience interval of log of price for 4 nights
CI_dependent <- confidence_interval(model_predict[1,], .95)
#Finding the actual value by using the exponential function
CI_lower_dependent <- exp(CI_dependent[1])
CI_upper_dependent <- exp(CI_dependent[2])
CI_upper_dependent upper
940615.6
CI_lower_dependent lower
3071.019
We estimate the average price to stay at Buenos Aires in a private room is 7,244 Argentinian Pesos. The 95% confidence interval for the price to stay is 3,071 and 940,615 Argentinian Pesos.
Our multiple linear regression model shows that the intercept is 8.712, this means that the natural logarithm of price_4_nights is 8.712. Since our base case is a condominium, this tells us the Expected Value for the logarithm of the price for 4 nights in a condominium, when all the explanatory variables are held at 0, is 8.712. If we take the exponential of this figure, we get 6075.38 Argentinian Peso, equivalent to £44.78. For the property type subcategories, all have negative coefficients and therefore any deviations from the base case property type (condominium) are associated with a decrease in the price for 4 nights. The largest effect is seen in properties classified as a private room in a residential home, with a coefficient of -0.806. This makes intuitive sense, as the decrease in privacy of having a room in a residential home could make it less attractive to travellers, and this could be compensated for by charging a lower price.
The review score rating has a positive relationship with price for 4 nights, a 1 unit increase in review score rating is associated with a 2.11% increase in the logarithm of price for 4 nights. Another positive relationship is also observed with number of bedrooms, the coefficient is 0.255, meaning that an extra bedroom is associated with a 25.5% increase in the logarithm of price for 4 nights. This is an expected result which makes logical sense as prices are likely to increase with the size of the property and capacity for people to sleep in.
Interestingly, the coefficient for number of beds is negative at -0.063, an extra bed leads to a decrease in log price for 4 nights of 6.3%. A positive relationship might be expected, as seen in number of bedrooms, however the negative relationship might be explained by a greater number of beds which could be small single beds which might be less desirable for travellers who are willing to pay higher prices.
As might be expected, the number of people the property can accommodate is positively correlated with price. With a coefficient of 0.125, an increase of capacity to accommodate an extra person is associated with a 12.5% increase in the log of price for 4 nights. An extra bathroom is associated with an 8.08% increase in the log of price for 4 nights.
The fact that those properties with an instant book feature are negatively related with price might seem surprising, the coefficient is -0.039, this would seem to be a useful feature for customers as it provides convenience. It is likely that the negative relationship exists because properties classed as more luxurious may be more selective in who they allow to stay, therefore not allowing the instant book feature, and these properties are also likely to command a higher price.
The availability of the property 30 days in the future has a weak positive correlation with the price for 4 nights, with a coefficient of 0.008, therefore higher availability is associated with a higher price.